Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow #34842

kylase · 2020-06-17T07:25:11Z

Infer the file system to be passed to pyarrow based on the path provided.

closes BUG: to_parquet write partitioned DataFrame to local filesystem for other filesystems (e.g. S3) #34841
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2020-06-17T07:25:14Z

Hello @kylase! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-18 09:04:33 UTC

TomAugspurger

Thanks. We'll need tests and a whatsnew for this.

TomAugspurger · 2020-06-17T11:33:09Z

pandas/io/parquet.py

@@ -104,19 +104,26 @@ def write(
            from_pandas_kwargs["preserve_index"] = index

        table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
+
+        fs = get_fs_for_path(path)


This feels like it should be done at a higher level in the base class and passed through.

Call it filesystem.

FastParquetImpl doesn't require this as it uses get_file_and_filesystem to obtain the filesystem in the read method.

fs is used in the read method for PyArrowImpl. To be consistent with it, fs is a better choice.

TomAugspurger · 2020-06-17T11:33:26Z

pandas/io/parquet.py

        # write_to_dataset does not support a file-like object when
        # a directory path is used, so just pass the path string.
        if partition_cols is not None:
            self.api.parquet.write_to_dataset(
                table,
                path,
                compression=compression,
+                filesystem=fs,


What if the user provides a filesystem?

Thanks for catching this. I realised this after looking at the tests that failed.

Perhaps, filesystem should be one of the keyword arguments instead of getting it from kwargs?

I think we need to update the doc to mention that filesystem is needed if the system can't obtain the underlying credentials from e.g. ~/.aws/credentials?

alimcmaster1 · 2020-06-20T08:01:22Z

pandas/tests/io/test_parquet.py

@@ -568,6 +568,24 @@ def test_s3_roundtrip_for_dir(self, df_compat, s3_resource, pa, partition_col):
            repeat=1,
        )

+    @td.skip_if_no("s3fs")
+    @pytest.mark.parametrize("partition_col", [["A"], []])
+    def test_s3_roundtrip_for_dir_infer_fs(


This looks identical to https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py - could we parametrize instead here?

alimcmaster1 · 2020-06-20T08:04:54Z

pandas/io/parquet.py

@@ -104,19 +104,23 @@ def write(
            from_pandas_kwargs["preserve_index"] = index

        table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
+
+        fs = kwargs.pop("filesystem", get_fs_for_path(path))


Perhaps we should wait till: https://github.com/pandas-dev/pandas/pull/34266/files#diff-0d7b5a2c72b4dfc11d80afe159d45ff8L153 is merged. Since this method is changing. @TomAugspurger might have thoughts

I agree with that #34266 is a more robust solution, which should be in v1.1?

This PR could be a patch for 1.0.x as to_parquet with partition_cols specified is not working as designed. Without partition_cols it works as intended without the need to specify the file system explicitly.

We original backported this functionality to 1.0.x but it was then reverted in #34632 -> So there is no s3 directory functionality on 1.0.x currently. See whatsnew: https://pandas.pydata.org/docs/whatsnew/v1.0.5.html#fixed-regressions. So think this will have to wait for 1.1

Ok, sounds good. I will alter this after #34266 is merged.

jorisvandenbossche · 2020-07-13T08:25:48Z

#34266 has been merged in the meantime. @kylase if you have time to revisit this?

kylase · 2020-07-15T06:12:52Z

#34266 has been merged in the meantime. @kylase if you have time to revisit this?

Will take a look at it over the weekend. With a quick glance at the change, I think it should be fixed but we should add the test in to make sure the file system is inferred with just the path given.

alimcmaster1 · 2020-08-21T00:21:16Z

#34266 has been merged in the meantime. @kylase if you have time to revisit this?

Will take a look at it over the weekend. With a quick glance at the change, I think it should be fixed but we should add the test in to make sure the file system is inferred with just the path given.

@kylase are you still interesting in working on this? whatsnew will be 1.2 now

Thanks

kylase · 2020-08-21T15:57:37Z

I have

#34266 has been merged in the meantime. @kylase if you have time to revisit this?

Will take a look at it over the weekend. With a quick glance at the change, I think it should be fixed but we should add the test in to make sure the file system is inferred with just the path given.

@kylase are you still interesting in working on this? whatsnew will be 1.2 now

Thanks

I have looked at the tests in 1.1 and there are tests for explicit filesystem and inferred. Therefore I feel that the issue has been resolved.

Will proceed to close this PR and the issue.

simonjayhawkins added the IO Parquet parquet, feather label Jun 17, 2020

TomAugspurger reviewed Jun 17, 2020

View reviewed changes

kylase marked this pull request as draft June 17, 2020 12:13

kylase marked this pull request as ready for review June 17, 2020 17:24

kylase requested a review from TomAugspurger June 17, 2020 17:24

Yuan Chuan Kee added 12 commits June 18, 2020 17:03

Pass filesystem to parquet.write_table and parquet.write_to_dataset

fdb3d7c

Fixed style issues

a39e900

Simple refactor

b30aebb

Style fix

eacd613

Handle user providing filesystem in kwargs

38a50fc

Update whatsnew

7bd1ef7

Remove filesystem from kwargs if it exists

2400420

Add tests without the filesystem in the kwargs

5ce4ffd

Remove whitespace

649ce2b

Really remove the whitespace

ff444f4

Apply black

720b6c0

Remove comma for PEP8

a918c6e

kylase changed the title ~~Pass filesystem to parquet.write_table and parquet.write_to_dataset~~ Infer filesystem from path when writing a partitioned DataFrame to remote file systems Jun 18, 2020

kylase changed the title ~~Infer filesystem from path when writing a partitioned DataFrame to remote file systems~~ Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow Jun 18, 2020

alimcmaster1 reviewed Jun 20, 2020

View reviewed changes

kylase closed this Aug 21, 2020

kylase deleted the fix/parquet-write-dataset-s3 branch August 21, 2020 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow #34842

Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow #34842

kylase commented Jun 17, 2020 •

edited

Loading

pep8speaks commented Jun 17, 2020 •

edited

Loading

TomAugspurger left a comment

TomAugspurger Jun 17, 2020

TomAugspurger Jun 17, 2020

kylase Jun 17, 2020

TomAugspurger Jun 17, 2020

kylase Jun 17, 2020

alimcmaster1 Jun 20, 2020

alimcmaster1 Jun 20, 2020

kylase Jun 20, 2020 •

edited

Loading

alimcmaster1 Jun 20, 2020

kylase Jun 22, 2020

jorisvandenbossche commented Jul 13, 2020

kylase commented Jul 15, 2020

alimcmaster1 commented Aug 21, 2020

kylase commented Aug 21, 2020 •

edited

Loading

Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow #34842

Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow #34842

Conversation

kylase commented Jun 17, 2020 • edited Loading

pep8speaks commented Jun 17, 2020 • edited Loading

Comment last updated at 2020-06-18 09:04:33 UTC

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylase Jun 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 13, 2020

kylase commented Jul 15, 2020

alimcmaster1 commented Aug 21, 2020

kylase commented Aug 21, 2020 • edited Loading

kylase commented Jun 17, 2020 •

edited

Loading

pep8speaks commented Jun 17, 2020 •

edited

Loading

kylase Jun 20, 2020 •

edited

Loading

kylase commented Aug 21, 2020 •

edited

Loading